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Abstract 

Posterior matching (PM) is a sequential horizon-free feedback communication scheme introduced by the authors, 
who also provided a rather involved optimality proof showing it achieves capacity for a large class of memoryless 
channels. Naghshvar et al considered a non-sequential variation of PM with a fixed number of messages and a 
random decision-time, and gave a simpler proof establishing its optimality via a novel Shannon-Jensen divergence 
argument. Another simpler optimality proof was given by Li and El Gamal, who considered a fixed-rate fixed 
block-length variation of PM with an additional randomization. Both these works also provided error exponent 
bounds. However, their simpler achievability proofs apply only to discrete memoryless channels, and are restricted 
to a non-sequential setup with a fixed number of messages. In this paper, we provide a short and transparent proof 
for the optimality of the fully sequential horizon-free PM scheme over general memoryless channels. Borrowing 
the key randomization idea of Li and El Gamal, our proof is based on analyzing the random walk behavior of the 
shrinking posterior intervals induced by a reversed iterated function system (RIES) decoder. 


I. Introduction 

Posterior Matching (PM) is a simple and general feedback communication scheme introduced by the authors, 
who also showed it achieves capacity for a large class of memoryless channels, including discrete alphabets, 
continuous alphabets, and mixtures thereof m-Ei. One appealing feature of the PM scheme is that it is horizon- 
free and sequential, in the sense that the transmitter may send an infinite sequence of bits, and the receiver can 
decide to stop at every instant n; the receiver is then able decode roughly nC bits from the prefix of this sequence 
with vanishing error probability, where C is the capacity of the channel. Alternatively, the receiver is also able to 
decode the bits on the fly as soon as they become reliable enough. As argued in m, PM can easily be converted 
to the more traditional settings where the number of messages and/or the horizon are fixed. 

While heuristic arguments for the optimality of PM are simple and appealing (see m, and going back to 
the special case of the Horstein scheme |]4], 0), the original optimality proof in IT] is quite involved and 
nontransparent. Coleman 0 studied the PM scheme from a novel stochastic control and Lyopanov exponent 
perspective, and provided a conceptually cleaner approach for its analysis. Naghshvar et al fT] considered a non¬ 
sequential variation of PM restricted to discrete memoryless channels (DMCs), where the number of messages is 
fixed but the decision time (horizon) is random. Introducing a novel Shannon-Jensen divergence, they provided 
a simpler proof showing that their scheme achieves the capacity of any DMC. Li and El Gamal |8] considered 
the same setting but with a fixed horizon. They described a randomized variation of PM and provided a simpler 
proof showing it achieves the capacity of any DMC. A key ingredient in their scheme was a random shift applied 
to the message point after each PM iteration, which circumvented some of the analysis obstacles. Both m and 
0 also provide error exponent results. 

In this paper, we adopt the random shift idea of Li and El Gamal, and consider a randomized version of the 
fully sequential horizon-free PM scheme. We provide a short and transparent optimality proof, showing that this 
scheme achieves the capacity for a very large class of memoryless channels, including all DMC and also many 
continuous alphabet and mixed alphabet channels. Our proof is based on analyzing the random-walk behavior of 
a reversed iterated function system (RIPS) decoder introduced in JT]. Unlike the deterministic PM scheme in m, 
the combination of RIFS decoding and the random shift operation facilitates a much cleaner analysis and avoids 
the problem of fixed points that was a major obstacle in the original proof. 

The authors are with the Department of EE-Systems, Tel Aviv University, Tel Aviv, Israel {ofersha@eng.tau.ac.il, meir@eng.tau.ac.il}. 
The work of O. Shayevitz was supported by the Israel Science Foundation, grant no. 1367/14. 
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II. Preliminaries 

A. Definitions and Basic Lemmas 

Recall that a real-valued stochastic process T„ is called a submartingale if E(T„+i | T") > T„ for any n. The 
following result is well known. 

Lemma 1 (Martingale Convergence Theorem 13). Let Tn be a submartingale. 7/'sup„E|T„| < cxd then Tn 
convergence a.s. to some r.v. T and E|T| < oo. 

Let p : [0,1] I—>■ K be a Lebesgue measurable function. With some abuse of notations, we naturally extend g to 
operate on subsets of its domain in an element-wise fashion, namely g{A) = Ux^Aigix)} for any set A C [0,1]. 
We write | A| for the Lebesgue measure of the set A, whenever the former exists. Define the A-smoothed derivative 
of g to be 

D\[gix)] = j\g{[x-^,x + ^] modl)|, 
where t mod 1 = f — [fj is the modulo 1 operational Let 

Digix)] = limsupDA[5(a;)]. 

A^O 

The following lemma is easily verified. 

Lemma 2. If g{x) is differentiable at xq S (0,1) with a derivative g'{xf}, then D[g{xo)] = |p'(xo)|. Furthermore, 
if g is absolutely continuous on [0,1], then 

Dx[g{x)] = E \g' ((x -f Qx) mod 1)|, 

where Qx ~ Unif ^. 

Now, further define 

D[g{x)] = sup Dx[g{x)]. 
ag(o,i) 

When g is absolutely continuous and monotonic (which will be our case of interest), then D[g{x)] is the maximal 
stretching of any symmetric interval (modulo 1) around x by g. The following lemma is a consequence of the 
Hardy-Littlewood maximal inequality iTol . and states that D[g{x)] is unlikely to be too large, provided that g is 
well behaved. The proof is relegated to the appendix. 

Lemma 3. Let p : [0,1] h-> K. be absolutely continuous on [0,1], and X ^ Lfnif{[0, 1]). Then for any a > 0, 

Pr(;D[g(X)] >a) <9a-iE|5'(X)|. 

Remark 1. Note that if g is Lipschitz (which corresponds in the sequel to the case of discrete alphabet channels), 
then a stronger asymptotic statement trivially holds: Pr {p\g{X)] > a) = 0 for all a large enough. 

Let {X,Y) ^ PxY be jointly distributed real-valued random variables. Let Fx be the c.d.f. of X, and 
be its functional inverse, generally defined by 

• Fx{x) > u}. 

It is easy to verify (see e.g. |[T|) that we can always define an auxiliary r.v. 0 Unif([0,l]) such that X = E^^(0). 

This induces a joint distribution Pqxy- Let Fe|y(0 | y) denote the conditional c.d.f. of 0 given Y, also known 
as the PM kernel Cl. We will also be interested in the inverse PM kernel FQ^yi^ \ y), which is the functional 
inverse of the PM kernel w.r.t. 0 JT]. 

*One may equivalently identify [0,1) with the circle R/Z, in lieu of the modulo notation. The cyclic definition of the smoothed derivative 
takes care of what happens near the edges of the unit interval, and is essential for our purposes later due to the random shift. The definition 
(and associated results in this section) work with minor adaptations for any other interval domains (with the proper modulo) or when the 
domain is R (without the modulo). 


3 


In the remainder of the paper, we restrict our attention to the following family ^ of all distributions Pxy 
admitting the following two properties: 

(PI) Fe\Y{e\ y) (resp. | y)) is absolutely continuous and strictly monotone in 6 * G [0,1] (resp. v G [0,1]) 

for Py-a.a. y. 

(P2) There exists some (5 > 0 such that 

HmE|logf?A[i"e|r(^ I 

where Y ^ Fy and V ^ Unif([0,1]) are independent, and the A-smoothed derivative is taken w.r.t. v. 

Remark 2. The family ^ is quite rich and includes all discrete distributions, as well as many continuous and 
mixed alphabet distributions. See Remark [3 following Theorem [T] 

The following claims are readily verified. 

Lemma 4. Suppose Pxy satisfies property \(P1 )| Then 

(i) S^jF^^yIv I y) = l// 0 |y(E”|^y(w I y) \ y) for Py-om. y. 

(ii) I{X-,Y) = 1(0; y) < cx). 

Finally, we say that a r.v. X is stochastically smaller than another r.v. Y, if Pr(y < a) < Pr(X < a) for 
any a. More generally, we say that X is stochastically smaller than Y given some event A, if Pr(y < a | A) < 
Pr(X < a) for any a. 

B. Setup 

We are concerned with the following feedback communication setup. A transmitter is in possession of a message 
point 00 ~ Unif([0,1]), its binary expansion representing an infinite i.i.d. uniform bit sequence to be reliably 
communicated to a receiver over a memoryless channel Py\x- The input and output of the channel at time n are 
denoted Xn and Yn respectively. We assume there is a noiseless instantaneous feedback link from the receiver back 
to the transmitter, so that at time n the transmitter is in possession of Y^~^. The memoryless channel model means 
that Yn is independent of Y^~^, 0 o) given and that y„ | Xn = ~ Py\x{' I Xn)- Furthermore, we 

assume the transmitter and the receiver share some common randomness; specifically, we assume they can jointly 
draw an i.i.d. sequence {14, ~ Unif ([0,1])}5^]^, where 14 is statistically independent of (0o, X", y”, 1/"“^). 

A (sequential, horizon-free) transmission scheme is an infinite sequence of mappings that determine the next 
channel input Xn+i as a function of (0o,y",y"). A decoding rule is a corresponding sequence of functions 
that map (y",y") to an interval (modulo 1) Jn, in which the receiver believes the message point lies. The 
error probability attained by a scheme and a decoding rule at time n is = Pr(0o ^ Jn), and the associated 
instantaneous rate is log \Jn\- The relation to decoding actual bits is simple: Identifying the said interval 

of size 2“"^" essentially guarantees that the ni?„ most significant bits of 0o can be decoded with error probability 
Pe, up to technical edge issues that can be easily resolved (see mi A transmission scheme is said to attain a 
rate R, if for any target error probability pe > 0 there is a suitable decoding rule such that Pr(i?„ > i?) —1 
as n —)■ oo. In the following two subsections we describe a simple and optimal construction of a transmission 
scheme and decoding rule, namely the randomized PM scheme with RIFS decoding. 

C. Randomized Posterior Matching 

Let Py\x a memoryless channel law, and set some input distribution Px (say, capacity achieving under 
some input constraint). Consider the following recursively defined transmission scheme: 

01 = 00 
Xn=Ff,\Qn) 

0„+l = (Fe|Y (0n I Yn) + Vn) mod 1 (1) 

The scheme in ([T]i will be referred to as the randomized PM scheme. Note that for 14 = 0 this coincides with 
the classical PM scheme IT]. The randomization idea is key to our simplified analysis, and is due to Li and El 
Gamal jS] who analyzed a non-sequential fixed-rate fixed-block-length version of this scheme in a DMC setting. 
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We recall a few known properties of PM that are also inherited by its randomized sibling, with minor 
modifications accounting for common randomness. The proofs follow easily from the associated claims in |[T1, 
e.g. by thinking of (F„,14,) as the channel output, and are omitted. 

Lemma 5. The randomized PM scheme satisfied the following: 

(i) 0„ ~ Unif{[0, 1]), Xn Px, and F„ -- Py. 

(ii) Qn (and hence Xn) is statistically independent o/1/"“^). 

(Hi) and are mutually independent i.i.d. sequences. 

(iv) /(0o;rn I Y^-\V^)=I(X;Y). 

(V) /(0o;r" I V^)=nI(X-,Y). 


D. Reversed Iterated Function System (RIFS) Decoding 

In this subsection we describe a decoding rule for the randomized PM, that maps F" into an interval that is 
guaranteed to contain the message point 0o up to a prescribed error probability (see HI for more details). Let 
I y) be the inverse PM kernel, i.e., 

I y) = inf{6» : F0 |y(0 | y) > i;}. 

Set some target error probability Pf, > 0, and let Jg C (0, 1) be an interval of size | Jo| = 1 — Pe- The RIFS 
decoder outputs the interval defined recursively by 

Jk+l — PQ^YiiJk Yn — k) mod 1 | Yn — k) (2) 

for fc = 0,..., n — 1. Recall that we effectively identify [0,1) with the circle R/Z, hence we allow wrap-around 
intervals, i.e., the interval (a, b) for a > 6 is the union (a, 1) U [0, b). 

Lemma 6 (H]). The probability of error incurred by the above RIFS decoder is Pr(0o ^ Jn) = Pe- 
Proof: 


Pr(0o G Ju) = Pr(0i G Ju) 

= EPr(0i G Jn I YyVi) 

= EPr(02G J„_i |Fi,Fi) (3) 

= Pr(02 G Jn-i) (4) 

= • • • (5) 

= Pr(0„ G Jo) 

= 1-Pe- (6) 


© follows since © is invertible given Yn-k,Vn-k, by virtue of property |(P1)| © follows since by Lemma |5] 
0fc+i is independent of (Yk, Vk). In © we iterate the same arguments, and © holds by definition. ■ 

Define the sequence of contraction terms: 


Lk = log 


\Jk\ ) 


and set Lq = — log(l — Pe)- Define further 


1 


n 


Rn ^ - V Lfc. 

n 

k=0 


From the discussion above it is clear that the RIFS decoder outputs an interval of (random) size in which 

00 is guaranteed to lie with probability 1 — pe- Therefore, is the (random) instantaneous rate of randomized 
PM under RIFS decoding with error probability p^. In what follows, we will be interested in guarantees on i?„. 
As we shall see, in many cases becomes arbitrarily close (for any target p^) to the optimal value I{X;Y) 
with high probability as n grows large. Thus, randomized PM can achieve any rate up to channel capacity. 
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III. Main Result 

We state our main result, showing that under very mild regularity conditions the randomized PM scheme with 
RIFS decoding achieves any rate below the mutual information. 

Theorem 1. Let {X,Y) ^ Pxy G S ond assume that 0 < I{X]Y) < oo. Then for any target error probability 
Pf, and any e > 0, the decoding rate achieved by the associated randomized PM scheme with RIFS decoding 
satisfies 

lim Pr(i?„ > I{X;Y)-s) = 1 

n—fco 

Remark 3. The conditions in the theorem are very general, and specifically hold in the following cases: 

• For any discrete memoryless channel with any input distribution such that I{X\Y) > 0. In this case fl] 
the PM kernel is a quasi-linear function in 0 for any fixed y, with slopes corresponding to the conditional 
distributions of x given y. 

• When the conditional p.d.f. fx\Y{x\y) exists, is bounded, and has bounded support, for any y. 

• For any additive noise channel Y = X + Z where Z is independent of X, both Z and Y have bounded 
p.d.fs, and either: 

- fz{z),fY{y) have bounded supports; or, 

- fz{z) > /^(y) > and < cx) for some ki,k 2 > 0. This includes in 

particular the additive Gaussian channel with a Gaussian input, where the scheme essentially reduces to the 
well known Schalwijk-Kailath Scheme im, 02. Note that this subfamily also includes mixed alphabet 
channels, e.g. binary input and additive Gaussian noise, etc. 

Remark 4. The original PM optimality result (no randomization) requires the posterior matching kernel to be 
free of any fixed points IT]- It was further shown in ini that the existence of such fixed points is possible, and 
that in such a case no positive rate can be attained, unless a suitable input transformation is applied. We note that 
the randomized PM does not suffer from this issue; the fixed point problem is “washed away” by the random 
shifting operation. 


IV. Proof of Main Result 


A. Proof Sketch 

Before we proceed to formally prove Theorem [T] we give a heuristic argument that captures the essence of 
the proof. Let Sn = contraction terms at time n. First, note that if we fix the 

horizon n, the process is a Markov chain in the time index k. Alas, the stochastic process Sn is not 

a Markov chain in the horizon parameter n, since the RIFS process evolves backward in time (see ||T1 for more 
details). However, since we are only interested in the asymptotic (marginal) behavior of as the horizon n 
grows unbounded, then instead of hxing the horizon n and analyzing the process Sk, we can assume the horizon 
is inhnite and think of Sn as a Markov chain for any n G N (with some abuse of notations, where we replaced Sk 
with Sn)- The associated processes and J„ will be indexed by n as well. In other words, we are effectively 
thinking of the decoding process going forward in time, instead of backward. 

How does the process Sn evolve? At time n, imagine we are in possession of some random interval of 
size \Jn\= 2“'^", corresponding to the interval the RIFS holds after n backward iterations. The position of J„ is 
uniformly distributed over the unit interval modulo 1, due to the random shift operation. We independently draw 
a r.v. Yn ^ Py (recalling that the output sequence is i.i.d), and apply the inverse PM kernel to obtain the next 
interval J„+i = PQ^yi'^n \ Yn), which is then randomly shifted modulo 1. This procedure yields the update 


Sn+i = Sn+ Ln, where L, 



The process Sn is thus a Markovian random walk on R+, starting from S'o = — log(l —Pe), with the contraction 
terms as its increments. 

Now, assume that Sn is already very large, i.e. that the associated interval size | J„| is very small. What is the 
increment in this case? Clearly, will shrink (or stretch) by a (random) factor that is roughly the derivative 
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of FQly{v I y) w.r.t. v, evaluated for y = Yn and at v that is (say) the random midpoint of J„, which is 
~ Unif([0,1]) and independent of Yn. By Lemmaclaim [(i)] this derivative is equal to 1 /{v I 2 /))- 
The contraction term is hence roughly log/e|Y(T' 0 |y(fn | Yn) \ Y„). Defining 0 = FQ|y(14, | Yn), it is readily 
verified that (0,F„) ^ Pqy as induced by Pxy and X = F^^(0) (see Lemma |7]i. Thus, we conclude that 
when Sn is large, the contraction term L„ has distribution close to that of the r.v. log/e|Y(0 | Y), and hence 
EL„ Ri 7(0; y) = I[X\Y). Thus, as long as Sn does not become too small, it grows like the sum of roughly 
i.i.d. random variables with expectation I{X\Y), which is why we expect 5'„ to be close to nI{X;Y). 

Of course, the devil is in the details. The main technical challenge is to bound the behavior of the chain for 
small Sn, in which case the contraction terms behave quite differently; in contrast to the case of a large Sn where 
the distribution of the contraction terms is essentially independent of the actual value of Sn, here this distribution 
strongly depends on the exact position of the random walk. More specifically, instead of being the logarithm of 
the derivative of the inverse PM kernel, the contraction terms in the “small” regime correspond to the logarithm of 
the A-smoothed derivative of the inverse PM kernel, with a smoothing factor of A = 2“'®". In the next subsection, 
we deal with these difficulties: First, we show that Sn spends overall little time in the “small” regime (note that 
it can go back and forth between “large” and “small”). Then, we couple the process Sn with a simpler process 
S'n that has only two modes of i.i.d. behavior, corresponding to whether Sn is “small” or “large”. We show that 
the contribution of the “small” mode of S'n is negligible, and that consequently S'n is close to nI{X\Y) with 
high probability. The proof is then completed by observing that S'n is stochastically smaller than Sn- 

B. Detailed Proof 

In this subsection we prove Theorem [T] We use the definition of Sn as a Markovian random walk on R+, with 
the time arrow going forward instead of backward, as described in the previous subsection. Define the random 
variable 

L^^^^-\ogDx[F-^y{V\Y)], (7) 

where Y ^ Py and V Unif([0,1]) are independent. Clearly, the distribution of is the same as the 
distribution of the contraction factor given that Sn-i = — log A. 

We begin by proving two lemmas characterizing the behavior of 

Lemma 7. Let 0 = FQ^yiV \ Y). Then (0,F) Pqy and 

/e(e) 

Proof: By assumption |(P1)| Lemma |2] and Lemma |4] claim we have that given V = v and Y = y 
lim-logDA[J^e|V(^ I y)] = -\og^ (^F-^yiv \ t/)) 

= ^ogfe\YiF-^y{v \y)\y) 

^ f 0 \Y{FQ^y{v \y)\y) 

~ /e(i"e|V(^ I y)) 

for Pyy-a.a. (u, y), where the last step follows trivially since fslf)) = 1 for any 9 G (0,1). It follows that 
converges a.s. to the random variable log/e|y(0 | Y), where 0 is defined in the Lemma. Now 

Pr(0 <9lY = y)= Pr(p-|V(C I Y) < 9 I Y = y) 

= Pr(V<FeiY(ffly)lY = y) (8) 

= FeiY(0ll/), (9) 


where ([8]l holds due to the strict monotonicity of the PM kernel under assumption |(P1)| and (|9|l follows since Y 
and V are independent. Hence, (0,Y) ^ Pqy according to the joint distribution induced by (Px, pYjx)- This 
completes the proof. ■ 
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Lemma 8. satisfies the following properties: 

(i) EL^^) is continuous in A over [0,1]. 

(ii) liniA^i EL^^) = 0. 

(Hi) liniA^o =I{X;Y). 

(iv) If I{X; y) > 0 then 0 < EL^^) < I{X] Y) for any X € (0,1). 

Proof: The first claim follows easily from assumption |(P1)| by the continuity of the inverse PM kernel. The 
second claim holds since F~^{- \ y) maps the unit interval to itself for any y. Let us prove the third claim. By 
property |(P2)| of the family there must exists some Aq > 0 such that {L^^^}Ag(o,Ao) bounded in Cp for 
p = 2 + ^ > 1. Hence {L^^^}Ae(o,Ao) ^re uniformly integrable. By Lemma [T] also converges a.s. to a finite 
limit. Thus, by Vitali’s convergence theorem ifTOl . we can change the order of limit and expectation, i.e.. 


lim EL(^) = E lim 

A->0 A->0 


= Elog 


/e|y(0 I n 
/e(0) 


= /(0;y) 


= I{X;Y), 


where we have used Lemma |4] claim |(ii)| in the last step. 

For the fourth claim, note that we can write 

L 0 ) = - logEg (i// 0 |y(F 0 |U(F + Q) mod 1 | y) | E)) , 

where Q ^ Unif([—-I, ■!]) is independent of V,y. We therefore have that 

EL0) = Ey,yL0) 

= -Ey.y logEg (l// 0 |y(F-ly((E + Q) mod 1 | y) | y)) 


< EYYEg log / 0 |y(E-i^((U + Q) mod 1 | y) | y) 

( 10 ) 

= EA^,,yiog/ 0 |y(E-i^(y'|y)|y) 


= E 0 Ylog/ 0 |Y (0 1 y) 

( 11 ) 

= /( 0 ;y) 


= I{X-,Y), 

( 12 ) 


where V' = {V + Q) mod 1 is uniform over the unit interval. We have used Jensen’s inequality in (fTOl) . which 
is strict since A > 0 and I{Q;Y) > 0. (fTTTi follows from Lemma |7] and (ITSli follows again from Lemma |4] claim 
|(ii)| Similarly, 

EL0) = -EY,YlogEQ (l//e|y(ye|Y((^ + Q) “od 1 | y) | y)) 

> -logEy.yEg (l//e|y(F qIY((^ + Q) ^od 1 | y) | y)) (13) 

= -logEy,,^ (l//e|Y(i"0|V(V^' I n I Y)) 

= -logEeY(l//0|Y(0 m) 

= — logEyEeiY ( 1 // 0 |y (0 I Y)) 

= — logEYl 

= 0. (14) 


Using the properties of established above, we would like to show that Sn spends little time close to the 
origin. To that end, we first prove a the following lemma. 

Lemma 9. Sn is a submatrigale on R+, and Pr(limsup„_j,g ,3 Sn = c>o) = 1. 







Proof: The submartingale claim follows immediately from Lemma property |(iv)| Let us prove the other 
claim. Recall that by Lemma |8] is a continuous function of A over [0,1], and 0 < EL^^^ < I{X\Y) for 

any A G (0,1], where the upper and lower bounds are approached as A tends to zero and one respectively. It is 
therefore easy to construct a two-sided monotonically decreasing sequence {Afc}^_^ with limfe_>_oo Afc = 1 
and limfe_>oo Afc = 0 such that 

Afc 


for any k. Hence, 


inf EL(^) > 3 log ■ 

AG [Afc-i-i ,Afe) A/c-|-i 


Afc 


(15) 



inf 

Pr I 


> 


A^fAfe^i ,Afc) 

1 

A 


> 

inf 

Pr 1 


> 


AG[Afe4.i,Afc) 

1 

A 


> 

inf 

Pr 1 

^7,(A) 

> 


AG[Afe+i,Afc) 

\ 

A 


> 

0, 





Afc+r 


- inf EL(^'1 
3 A^G[Afe_j_i,Afe) 


(16) 

(17) 

(18) 


where ( fThl l follows from ( fTSl l. choosing A' = A establishes ( fTTb . and ( fTSl l trivially holds since EL(^1 > 0 on any 
closed subinterval of (0,1). 

Let {Tj^k}J^i be the sequence of all time indices n where S'„ G (— log Afc, — log Afc+i], where Tfc is the (possibly 
infinite) total number of such occuri'ences. Let Mk be the maximal time index n for which > — log Afe+i, and 
let b be some hxed positive integer. 

I limsup5'„ G (-log Afc, - log Afc+i] j = Pr(Mfc < oo,Tfc = oo) 

\ n—^oo J 

< Pr (Mfc < oo, Tfc > Mfc + b) 

OO 

= ^ Pr (Tfc > m -I- & I Mfc = m) Pr(Mfc = m) 


Pr 


m—0 

oo 


< Pr Lr- < log TO < j < TO -I- & I Mfc = TO^ Pr(Mfc = to 

V " Afc+1 J 


m—0 

oo 


Afc+i 

< y](l-4)''Pr(Mfc = TO) 

m—0 

<(1-4)^ 

Since 4 > 0, and as the above upper bound holds for any b and k, it must be that 


Pr 


(limsupS'n G (-logAfc,-logAfc+i] J = 0. 

\ n—¥<X) / 


The proof is now concluded by noting that 1R,+ = IJfc(~ log^^fe; ~ log Afc+i]. ■ 

We now further strengthen Lemma |9] and show that in fact diverges a.s., which will specifically show that 
it spends little time below any threshold t. Let Nt^n be the number of times Sk falls below t until time n, i.e., 

n 

7Vt,„^y]i(5fc <f), 

k=l 

and let Nt = lim„^oo iVt be a random variable on N U {oo}. 

Lemma 10. S'„ —>■ oo almost surely, hence Pr(i\} „ > to) < Pr(A} > to) = 5(rn) where 5{m) 0 as m ^ oo. 

Proof: The proof is based on arguments similar to M- Consider the process T„ = 1 — Below we 

show that Tn converges a.s., which together with Lemma |9] implies that that Tn ^ 1 a.s. and hence Sn ^ oo 
a.s., establishing the lemma. 
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First, we show it is sufficent to prove that there exists some to € (0,1) such that E(T„+i \ Tn = t) > t 
for any t > tg. To see that, define the process — max(T„,fo), and note that by definition it holds that 
E(T^_i_j \Tl^ = t) >t for any t, hence is a submartingale. Moreover, E|T^| < 1 for all n. By Lemma[T] it must 

therefore be that convergences a.s. to a limit. Since Pr(limsup„_^(^ ^ Pr(limsup„^oo = 1) = 1, 

this limit must be 1, i.e., 1 a.s . Since T„ = whenever > to, it must be that Tn ^ 1 a.s. as well. 

It remains to show the existence of such a to. Let us hrst establish some guarantees on the first and second 

moments of conditioned on an event that > a for some a. From Lemma | 8 ] we know that EL^^^ 

approaches I{X;Y) > 0 continuously as A —?> 0, hence in particular there is some ci > 0 such that EL^^) > ci 
for all A > 0 small enough. Trivially, it also holds that for any a 

E I > a) > EL(^) > Cl > 0 (19) 

for any A > 0 small enough. Moreover, property |(P2)| of the family ^ implies that is uniformly bounded in 
for all A > 0 small enough, hence EjL^^^p < C 2 for some some C 2 < oo. Trivially then, for any a it also 
holds that 

Pr(L(^) > a) • E | > a) < E < C 2 < oo (20) 

for all A > 0 small enough. 

Now, dehne the function g{s,i) = Since the process S'„ is nonnegative, we can clearly limit our 

discussion to ^ > —s, and hence to g{s,£) > —1. Let us write 

= (i + s) 2 +£(i + 5) 

_ _e __ 

(1 + s )2 (1 + s )3 + £(1 + s )2 ■ 

Setting any a € (0,1), it therefore holds that for any i > —(1 + s)“ and s > 2^^ — 1, 

£ £^ 

- (1 + S )2 " (l + s)3-(l+s)2+« 

^ £ e 

- (l + s)2 " 2(1 + s)3- 

Our analysis will now naturally depend on the event > —(1 + s)“. Let us first upper bound the probability 
of the complementary event; 


Pr(L„ < -(1 + sT I Sn 


s)<Pr(|L„|>(l + s)“|5„ = s) 

= Pr(|L„|2+-5 > (l + s)“(2+^) I Sn 


< 


^{\Ln\^+^ \Sn=s) 


E 


(1 + s)a(2+5) 

. 2+5 

■ 


(1 + s)“(2+'5) 

< C3 • (1 + 


s) 


( 22 ) 


(23) 


for some C 3 > 0 and any s large enough. We used Markov’s inequality in (l22T i. and (l2Tt is again by virtue of 
property |(P2)| of the family that implies is uniformly bounded in £2+<5 for all A > 0 small enough. 
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•1 1 

Writing t = \ — we have that for any s sufficiently larger than 2— 1 

E(T „+1 - I r„ = t) = E {g{s, Ln) | S'™ = s) 

= e(5(s,L(2-“))) 

= Pr < -(1 + s)“) . E ( 5 ( 5 , I < -(1 + s)“^ 

+ Pr(L(2-'’) > + •E(g(s,L(2-'‘)) | ^(2-'’) > 

> -C3 • (1 + s)-“(2+5) (24) 


Pr 


-Pr 


(l( 2 - 

{l^^~ 


>-{i+sr 


>-(i + sr 


E 


•E 


E( 2 -“) 
(1 + S )2 

L( 2 -^) 


l (2 “) > -{l + sY 


( 


\ 


2 (l + s)3 


i (2 > -(1 + s)^ 


> -C3 • (1 + s)-“(2+^) + 1^1 _ C3 . (1 + s)-“(2+5))^ . 


Cl 


C2 


(l + s)2 2(1 + s)3- 

(l24l i follows from (1211 1. (l2Jt . and since g{s^i) > —1. (|25]) follows from (fT9] l, (l20l i. and (l23T l. Examining (l25T l for 
any < a < 1 , it is immediately clear that this lower bound on the expected increment is positive for all large 
enough s, and hence for all t sufficiently close to 1. This concludes the proof. 


After establishing that Sn ^ oo a.s., we would like to further determine how fast this happens. To that end, 
we will define a coupled process S'^ that will be easier to handle, and will be stochastically smaller than Sn- 
Loosely speaking, S'^ will have two modes of i.i.d. random walk behavior corresponding to whether Sn is above 
or below the threshold f; it will also grow slower than Sn in each of these regimes. 

To do that, we hrst dehne two random variables U, W that will be stochastically smaller than given that 
Sn is above or below the threshold t respectively, and will later determine the increments of the coupled process 
S'n in these two regimes. For brevity, we omit the dependence of 17, W on t. Recall the definition of in ( 0 . 
We hrst dehne U, W via their c.d.fs as follows: 

Pr(t7 < u) = sup Pr < u) , 

Ae(o. 2 -‘] ^ ^ 

Pr(VP < w) = sup Pr < w') . 

Now, setting some large number ^ > 0, we dehne U, W as the truncation of U,W: 

U = imn{U,^), IE = min(kP, ^). 

Again, the dependence on ^ will be omitted for notational clarity. The following lemma describes some important 
properties of U and W. The proof is relegated to the appendix. 

Lemma 11. The following properties hold: 

(i) U is stochastically smaller than Ln given Sn-i = tQ for any to > t 

(ii) W is stochastically smaller than Ln given Sn-i = to for any to < t 
(Hi) EC/ < I{X; Y) for any t, 

(iv) lim^^oo linrt^.oo EC/ = I{X]Y). 

(vj E|1E| < oo for any ^,t > 0. 

We are now ready to dehne the coupled process S'n- Let {C/„} and {Wn} be two i.i.d. sequences with 
distributions Pu and Pw respectively, such that the processes {Un}, {Wn}, {Sn} are mutually independent. 
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Define S'^ to be the random walk process generated by replacing the increments of the process Sn process with 
U ov W elements, according to whether Sn is above or below the threshold. Precisely: 

n-Nt,„ Nt,n 

S'n= Y. Uk+Y^^- 

k=l k=l 

Note that unlike Sn, the coupled process S'n can become negative, since Pr(kP < 0) = 1. Also, S'n does not 
contain the fixed initialization term Lq = — log(l—Pe)- The proof of the following lemma appears in the appendix. 

Lemma 12. S'n is stochastically smaller than Sn- 


Let us now show the probability S'n falls below n{I{X;Y) — e) vanishes with n. 
Lemma 13. lim„_>oo Pr(S'^ > n{I{X;Y) — e)) = 1 for any e > 0. 

Proof: We write I = I{X\Y) for short. Set ^ and t large enough so that such that 

/-EL < e/8, 

which is possible by virtue of Lemma [TT] claims |(iv)| and |(iii)| Then: 


(26) 


fn-Nt,, 


Nn 


Pr(5; <n(/-e)) =Pr ^ Lfe + ^ VLfc < n(/- e) 

y k=l k=l J 

m 

< Fr{Nt,n > m) + ^ Pr(7V£,n = r)FT i^Uk + < n{I - s) \ Nt^n = 






< S{m) 4- ^ Pr(7Vi,n = r) 

r—l 

m 

< S{m) + Y, = r) 

r—l 

m 

<S{m)+Y = r) 


ns 

T 


ns 


r—l 


Pr ^Lfc<nJ-- V 

\k^l k^l 

/n—r \ / r 

Pr(^Lfc<nJ-^j +Pr f E “ 2 


\k=l 


(27) 

(28) 


Pr 


n — r 


k=l 


k=l 


(29) 

(l27l l follows from Lemma fTOl and since the sequences {£/„}, {E„} are mutually independent of {S'n}, hence of 
Ay n as well. (l28l l follows from the union bound. Analyzing the first term inside the parenthesis in ( |29] |. we note 
that for any 1 <r <m and n > m large enough, 

I-el 2 

I ^ VI 1 

n — r 


Pr 


1 


n — r 


Y^u„< 




1 - 


r I — 


/ .. n-r \ 

< Pr ^ E ^ 


n—r 


< Pr -- E j 

where (l30l l follows from (l26l l. and OTT i is by virtue of the law of large numbers. Furthermore, 


(30) 

(31) 




\ fc=l 

< ^ .E|fL| 
ns 


(32) 

(33) 


— (tt ), 

where (l32T l follows from Markov’s inequality, and (l33l l is by virtue of Lemma fTTI property |(v)| We therefore obtain 
that for any m and e there are t, ^ large enough such that 

Pr(Sn < n{I - e)) < 5{m) + 
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where 6 {m) —0 as m —>• oo. Since we can fix m arbitrarily large we have that 

lim Pr(S'^ < n{I — e)) = 0 

n—¥co 

as desired. 

Finally, combining Lemmas [T2] and [T3] with the definition of i?„, we obtain 

lim Pr(ii„ > I{X]Y) — e) = lim Pr(S'„ > n{I{X;Y) 

n—foo n—^oo 

> lim Pr(S'^ > n{I{X;Y) 

n—>oo 

= 1 , 


^)) 

^)) 


establishing the theorem. 


Appendix 

Proof of Lemma |5} Define the function 0 : R —>• R. 


(t>{x) = g'{t mod 1) • l{x G [—1, 2]). 

Let M(p{x) be the Hardy-Littlewood maximal function IfTOl Chapter 7] pertaining to (j){x), i.e., 

^ px-\-Xj2 

M (j}{x) = snp — / \4>(t)\dt 

A>0 ^ Jx-X/2 

= supE|(^(a; + Qa)!, 

A>0 

where Q\ ^ Unif ([—-j, -j]). For any a; G [0,1) we can also write 

M(j){x)> sup E|(^(a: + (5 a)| 

ag(o,i) 

= sup E|p'((a: + Qa) mod 1)1 

Ag(0,l) 

= D[g{x)], 


where we have used Lemma |2] in dLSl l. Hence 


(34) 


(35) 


Pr(D[p(a:)] > a) < Pr(M())(X) > a). (36) 

The Hardy-Littlewood maximal inequality ifTOl Chapter 7] implies that for any a > 0, the following measure- 
theoretic “generalized Markov inequality” holds: 

nOC 

|{a; : M(j){x) > a}| < 3a“^ / \4>{x)\dx 


= 3a ^ J \cj){x)\dx 

= 9a~^ f \g'{x)\dx. 

Jo 


Thus, if X ^ Unif ([0,1]) then 


FliMfix) >a)< 9a"^E \g'{X)\. 


(37) 


The proof now follows from (l3^ and (ITTI i. ■ 

Proof of Lemma 1771 

(i) 

Pr(Er!, < u I S'n-i = tf) = Pr(L^^ < u) 

< sup Pr(L(^) < u) 

AG(0.2-*] 

= Pr(?7 < u) 

< Pr(?7 < u). 
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(ii) Follows similarly. 

(iii) Follows similarly to Lemma |8] claim |(iv)| 

(iv) Follows similarly to Lemma |8] claim |(iii)| 

(v) Write g(t),t/) = ^ I v))’ 


Now, let w > 0. 


EyEv|g(L,y)| =ErEyg(L,y) 

= Ey I r) - Fe|V(0 I ^)) 

= 1 . 

Pr(lE < —w) = sup Pr < —iv) 

Ag(2-*,i) ^ 

< sup Pr( inf < -it; ) 

AG(2-*,1) / 

log sup i:)A'[^’e|V(^ I ^)] > I 

A A'g(O.l) J 

■{\ogD[F-l^iV\Y)]>w) 

■ (p[F-^y{V I F)] > 2“) 

Ey (Pr (;D[F-|V(E|y)]>2“|y)) 


= Pr 

= Pr I 
= Pr I 


<9-2-“-EyEy|c/(E,r)| 
= 9-2-“, 


where in (|39] | we have used Lemma |3] together with property |(P1)| and (l40l i follows from 

M\W\='e(^J^ l{W>w)dw + J^ l{W<-w)dw^ 

pCO pOO 

= / Pr(14^ > w)dw + / Ft{W < —w)dw 

Jo Jo 

poo poo 

< / l(i(;<^)(iw+ / 9-2~'^dw 

Jo Jo 

= C + 9 log e. 

Note that the bound is independent of t. 

Proof of Lemma [72} Let An = ]1(5'„ < f) = 1 <t). For any p: 

Pr(S'„ < /r) < Pr Lk < 

= Ein-iPr [^Lk< 

= Ein-i Pr |^L„ < M Efc I 

< E^n-i Pr 1^(1 - An-i)Ui + ^ | 

/ n-1 

= Pr I (1 ~ ^n-l)t^l + ^n-lW^l + Lk < ^ 


k=l 


(38) 


(39) 

(40) 


. Now, 


k=l 


( 41 ) 
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where (HTt follows since (t/i, Wi) are independent of and by virtue of the stochastic lower bound properties 

andin Lemma [TT] according to whether Ui or Wi is selected by An-i- Iterating the same argument we 
obtain 

( n n \ 

y^(l - An-k)Uk + ^ An-kWk < fj. I 

k=l k=l / 

= Pr(5'^ < m), 

where the last equality follows by noting that Ak = Nt^k — This concludes the proof of the Lemma. ■ 
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